1
The Memory-Centric Nature of GPU Performance
AI024 Lesson 5
00:00

In GPU acceleration, we must abandon the "compute-first" mindset. Modern performance is dictated by Memory Management: the orchestration of data allocation, synchronization, and optimization between the host (CPU) and device (GPU).

1. The Memory-Compute Disparity

While GPU arithmetic throughput ($TFLOPS$) has skyrocketed, memory bandwidth ($GB/s$) has grown at a much slower rate. This creates a gap where the execution units are often "starved," waiting for data to arrive from VRAM. Consequently, GPU programming is often memory programming.

2. The Roofline Model

This model visualizes the relationship between Arithmetic Intensity (FLOPs/Byte) and performance. Applications typically fall into two categories:

  • Memory-Bound: Limited by bandwidth (the steep incline).
  • Compute-Bound: Limited by peak TFLOPS (the horizontal ceiling).
Arithmetic Intensity (FLOPs/Byte)Performance (GFLOPS)Memory-BoundCompute-Bound

3. The Tax of Data Movement

The primary performance bottleneck is rarely the math; it is the latency and energy cost of moving a byte across the PCIe bus or from HBM. High-performance code prioritizes data residence and minimizes host-device transfers.

main.py
TERMINAL bash — 80x24
> Ready. Click "Run" to execute.
>